For the evaluation of the different RNA isolation kits in the first phase of exRNAQC, blood was drawn from 1 healthy volunteer. We tested 8 different kits:
Most kits allow a range of plasma input volumes. Therefore, we tested both the minimum and maximum input volume recommended by the supplier. The input volume in ml directly follows the abbreviated name in the plots in this report. This yields 17 unique combinations of kit and input volumes, with 3 technical replicates processed for every combination.
Seven performance metrics were evaluated. Kits for phase 2 were eventually selected based on transforming metrics for sensitivity and precision to robust z-scores (see Selection for phase 2).
First, basic parameters are set up in this RMarkdown, such as loading dependencies, setting paths and setting up a uniform plot structure.
Sample annotation with info about kit, used input volume, eluate volume etc.
RNA extraction Control (RC) spike-ins are added to plasma prior to RNA isolation, and Library Prep control (LP) spike-in controls to the RNA eluate prior to library prep. All spike-in RNA molecules were ordered at Integrated DNA Technologies as 5’-monophosphate and 3’OH RNA oligonucleotides, and diluted in 500 nM carrier DNA oligo (TCGAAGTATTC).
First, adapters were trimmed with cutadapt v1.18, allowing an error rate of 0.15 in the adapter. After trimming, a minimum length of 15 bases was required. Finally, reads where less than 80% of bases have a quality score of 20 or higher were filtered out.
Randomly downsample each library to the lowest number of reads (at FASTQ level) to make sure the comparison of metrics is fair. E.g. if one sample is sequenced deeper it is likely to yield more miRNAs compared to a sample that was sequenced less deep.
As the lowest number of reads after quality filtering is 1,843,262 (in MIRVE0.1 sample), we downsampled all samples to 1.5M reads.
reads_all <- read.table(paste0(data_path, "all_read_counts.txt"), header=TRUE) %>% mutate("UniqueID"=gsub("-.*$","",sample))
reads_melt <- reads_all %>% mutate(subs_lines=6000000) %>% melt(value.name = "reads")
reads_melt <- full_join(reads_melt, sample_annotation, by = c("UniqueID")) %>% #filter(variable != "clump_lines") %>%
mutate(reads=reads/4)
pd <- position_dodge(0.1)
ggplot(reads_melt,aes(x=reorder(variable, dplyr::desc(reads)),y=reads, group = UniqueID))+
geom_line(position = pd, alpha=0.6)+
geom_point(position = pd, alpha=0.8, size=1.4) +
theme_point + #coord_flip() +
#facet_wrap(~RunID, ncol = 3)
scale_y_continuous(limits = c(0,NA), breaks=c(seq(0,10000000, 2000000)), labels=full_nr) +
geom_hline(yintercept = 1500000, linetype = "dashed", color = "gray")+
labs(y="", x = "") +
#theme(axis.text.x=element_text(angle=45, hjust=1)) +
scale_x_discrete(labels = c("raw reads","adapter\n trimmed\n reads", "quality\n filtered \n reads", "reads after\n downsampling\n"), limits=c("orig_lines", "clipped_lines", "qc_lines", "subs_lines"))+
scale_color_manual(values=color_panel2)Figure 3.1: Number of reads at different stages in the data processing workflow. Raw reads: total number of reads in raw FASTQ files at start; adapter trimmed reads: number of reads after adapter trimming and requiring a minimum length of 15; quality filtered reads: number of adapter trimmed reads where at least 80% of bases have a phred score of 20 or higher; reads after downsampling: all samples were downsampled to 1.5M.
# Reads should be subsampled to: reads_melt %>% filter(variable == "pre") %>% select(reads) %>% min()
ggsave("reads_preprocessing_smallRNA.pdf", plot=ggplot2::last_plot()+labs(title="Number of reads at different stages"), path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)RNA spike-in reads are first isolated using a custom Python script. Non-spike reads are then mapped with Bowtie (1.2.2) and further annotated based on miRbase v22 (for miRNAs), UCSC hg38 (for tRNA) & Ensembl v91 (for other small RNAs: snoRNA, snRNA, MT_tRNA, MT_rRNA, rRNA en miscRNA).
spikes_sum <- spikes %>% gather(key="UniqueID",value="spikecounts",-"spikeID") %>%
group_by(UniqueID) %>% dplyr::summarise(counts=sum(spikecounts, na.rm=T)) %>%
#spread(key=UniqueID,value=spikesum) %>%
mutate(reads="spikes")
# add sum of spikes to the total mapped reads (these were not considered as "mapped" yet)
reads_complete <- reads %>% gather(key="UniqueID",value="counts",-"reads") %>%
rbind(., spikes_sum) %>%
spread(key=reads, value=counts) %>%
mutate(all_mapped = mapped + spikes) %>%
gather(key="reads",value="counts",-"UniqueID") %>%
mutate(percentage=counts/1500000*100) # % mapped (in total 1500000 reads)
# ggplot(reads_complete %>% filter(reads=="all_mapped"),
# aes(x=reorder(GraphKit, desc(GraphKit)),y=counts,col=RNAisolation)) +
# geom_point() +
# theme_point +
# coord_flip() +
# #theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# scale_y_continuous(limits=c(0,1500000),labels=full_nr) +
# scale_color_manual(values=color_panel2) +
# labs(x="",y="mapped reads")
ggplot(reads_complete %>% filter(reads=="all_mapped") %>% left_join(sample_annotation, by="UniqueID"),
aes(x=reorder(GraphKit, desc(GraphKit)),y=percentage)) +
geom_point()+
theme_point +
coord_flip() +
#theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
scale_y_continuous(limits=c(0,100),labels=full_nr) +
scale_color_manual(values=color_panel2) +
labs(x="",y="")Figure 3.2: Differences in mapping percentages. How many % of the 1.5M reads collectively map to spikes, miRbase v22 (for miRNAs), UCSC hg38 (for tRNA) & Ensembl v91 (for other small RNAs: snoRNA, snRNA, MT_tRNA, MT_rRNA, rRNA en miscelaneous RNA). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAP2: MagNA Pure 24 total NA isolation kit, 2 ml input; MAP4: MagNA Pure 24 total NA isolation kit, 4 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
#miR
tmp1 <- miRNAs %>% gather(key="UniqueID",value="counts", -"MIMATID") %>%
mutate(type="miRNA") %>%
dplyr::group_by(UniqueID,type) %>%
dplyr::summarise(sum_counts = sum(counts, na.rm=T)) #calculate sum of MIMAT per sample
#contaminants
tmp2 <- gather(contam, key="UniqueID",value="counts",-c("contam","ensembl_id")) %>%
dplyr::select(c("UniqueID","type"="contam","counts")) %>%
mutate(type = gsub("multiple_misc_RNA_snRNA","multiple_snRNA_misc_RNA",type)) %>%
mutate(type = gsub("multiple_tRNA_piRNA","multiple_piRNA_tRNA",type)) %>%
dplyr::group_by(UniqueID, type) %>%
dplyr::summarise(sum_counts=sum(counts,na.rm=T))
#join them together and calculate %
tmp <- rbind(tmp1,tmp2)
#notann
tmp3 <- tmp %>% ungroup() %>% dplyr::group_by(UniqueID) %>%
dplyr::summarise(sum_all = sum(sum_counts, na.rm=T)) %>%
full_join(gather(reads %>% filter(reads=="mapped"), key="UniqueID", value="mapped", -"reads") %>% dplyr::select(-reads), by="UniqueID") %>%
mutate(type="not_annotated",not_ann = mapped-sum_all) %>%
dplyr::select(c("UniqueID","type","sum_counts"="not_ann")) %>% group_by(UniqueID)
perc_all <- rbind(tmp,tmp3) %>%
full_join(reads_complete %>% filter(reads=="mapped") %>%
dplyr::select(-reads), by="UniqueID") %>%
ungroup() %>% dplyr::group_by(UniqueID,type) %>%
mutate(perc=sum_counts/counts*100) %>%
left_join(sample_annotation, by="UniqueID")
ggplot(perc_all, aes(x=TechnicalReplicate, y=perc,fill=type)) +
geom_bar(stat="identity") +
facet_wrap(~GraphKit, nrow=1) +
theme_bar +
theme(legend.position = "bottom") +
scale_fill_manual(values=c("black",color_panel[-3])) +
labs(y="%",x="")Figure 3.3: Fraction of mapped reads for each RNA biotype. Spike-in RNAs are not included. (miRNA: microRNA; misc_RNA: miscellaneous RNA - mainly Y-RNAs; multiple_piRNA_tRNA: multimapping to piRNA and tRNA; multiple_snRNA_misc_RNA: multimapping to small nucle(ol)ar RNA and miscellaneous RNA; not_annotated: mapping to reference genome but not annotated as small RNA; piRNA: piwi-interacting RNA; rRNA: ribosomal RNA; snoRNA: small nucleolar RNA; snRNA: small nuclear RNA; tRNA: transfer RNA)
RC (RNA extraction Control) spikes were added before RNA purification (volume adapted according to plasma input volume: (1 microL RC/ 100 microL plasma)). If there are more RC spikes detected, there is less endogenous RNA in plasma (of note, we used a large plasma pool from only one donor)
LP (Library Preparation) spikes were added to the RNA eluate after RNA purification (2 microL LP/ 12 microL eluate), prior to library preparation. The more LP spikes detected, the less endogenous RNA present in the eluate. Given that we always used the same eluate volume for library prep, more LP spikes indicate a lower endogenous RNA concentration in the eluate.
Note that MIRVE0.1 has a large fraction of LP spikes, pointing to a low RNA concentration in eluate, see miRNA concentration.
spikesum<-gather(spikes, key="UniqueID",value="counts",-"spikeID") %>%
mutate(spiketype=sub(".-.*","",spikeID)) %>% group_by(UniqueID,spiketype) %>%
dplyr::summarise(spikesum=sum(counts,na.rm = TRUE))
perc_spikes<-full_join(reads_complete %>% filter(reads=="all_mapped"), spikesum,by="UniqueID") %>%
dplyr::select(-reads) %>% drop_na()
perc_spikes<-perc_spikes %>% dplyr::mutate(perc=spikesum/counts*100)
#write_tsv(perc_spikes,"percentage_spikes.txt")
perc_spikes<-left_join(perc_spikes,sample_annotation, by="UniqueID")
ggplot(perc_spikes,aes(x=TechnicalReplicate,y=perc,fill=spiketype))+
geom_col()+
#labs(title="% spikes of all mapped reads")+
facet_wrap(~GraphKit, nrow=1) +
theme_bar+
theme(legend.position = "bottom") +
scale_y_continuous(expand = c(0, 0)) +
labs(x="",y="%")+
scale_fill_manual(values=color_panel_spikes2)(#fig:spike_perc)Percentage of mapped reads that goes to spike-in RNAs. LP (Library Preparation) spikes were added prior to library preparation, after RNA purification (blue). RC (RNA extraction Control) spikes were added prior to RNA purification (yellow). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
Spike counts are proportional to their relative concentration in the mix
ggplot(left_join(gather(spikes, key="UniqueID",value="counts",-"spikeID"),spikeconc) %>% left_join(sample_annotation, by="UniqueID"),aes(x=log(relconc,10),y=log(counts,10),color=sub(".-.*","",spikeID)))+
geom_point(size=0.8, alpha=0.5)+
facet_wrap(~GraphKit)+
#theme_point +
theme_classic() +
theme(legend.position = "top", legend.title = element_blank()) +
scale_color_manual(values=color_panel_spikes2) +
guides(color = guide_legend(override.aes = list(alpha = 1, size=1))) +
labs(x="log10(relative spike conc)",y="log10(spike counts)")Figure 3.4: Spike counts are proportional to their relative concentration in the mix. LP (Library Preparation) spikes were added prior to library preparation, after RNA purification (blue). RC (RNA extraction control) spikes were added prior to RNA purification (blue). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
Seven performance metrics are calculated.
In order to compare different kits in a uniform way across the different performance metrics, we transformed these into z-scores. We made sure that a higher value always corresponds to better performance. To account for the small sample size, we calculated robust z-scores.
#normal z-score calculation with: scale(data$measurevar, center=T, scale=T)
#function to calculate robust z-scores:
robzscore <- function(data, measurevar){
require(dplyr)
median_data <- median(pull(data,paste(measurevar))) #median
#MAD <- median(abs((pull(data, paste(measurevar))) - median_data)) #mean absolute deviation
s <- stats::mad(pull(data, paste(measurevar))) #scaling factor; mad = formula to calculate median absolute deviation which contains scaling factor of 1.4826!
if (s == 0) { #if MAD = 0, scaling factor = 0 so we get a denominator of 0 -> alternative scaling factor needed
mean_data <- mean(pull(data, paste(measurevar))) #mean
meanAD <- mean(abs((pull(data, paste(measurevar))) - mean_data))
s <- 1.253314*meanAD #alternative scaling factor, approximately equals the standard deviation
}
robz <- (pull(data, paste(measurevar)) - median_data) / (s) #calculate absolute difference between every value and median, and divide by scaling factor
}
#initiate z-score tables (z-score for each individual data point, later, we will use the median/mean per GraphKit)
z_indiv <- dplyr::select(sample_annotation, c("GraphKit", "SampleID"))
z_indiv_robust <- dplyr::select(sample_annotation, c("GraphKit", "SampleID"))The ratio of endogenous miRNA to LP count fractions reflects the relative concentration of endogenous miRNA in the eluate: if there is more endogenous miRNA in the input for the library prep, there will be less LPs and therefore the ratio endogenous miRNA/LP is higher. Remember that some kits have a much larger eluate volume after RNA isolation. A larger total eluate volume results in more diluted endogenous RNA (lower concentration) and therefore less endogenous RNA in library prep (given constant input volume for all library preparations).
Scoring measure: the higher the concentration, the better
#kallisto_genes <- kallisto %>% filter(!str_detect(ensembl_transcript_id,"LP|R1|R2")) %>% group_by(ensembl_gene_id) %>% summarise_if(is.numeric, sum, na.rm=TRUE)
gene_level_ratios <- rbind(reads %>% filter(reads=="mapped_miR") %>%
mutate(type="endogenous") %>% group_by(type) %>%
dplyr::summarise_if(is.numeric, sum, na.rm=TRUE),
spikes %>% mutate(type=gsub(".-..$","",spikeID)) %>%
group_by(type) %>%
dplyr::summarise_if(is.numeric, sum, na.rm=TRUE)) %>%
gather(., key="UniqueID",value="counts",-type) %>% #long format
spread(., key = "type", value="counts") %>%
mutate("LPvsEndo"=LP/endogenous,
"RCvsEndo"=RC/endogenous,
"EndovsLP"= endogenous/LP) %>%
left_join(., sample_annotation %>% dplyr::select(c("UniqueID","RNAisolation","SampleID","GraphKit","EluateV","PlasmaInputV")), by="UniqueID") #add annotation
# spikes1 <- ggplot(gene_level_ratios, aes(x=GraphKit, y=EndovsLP, fill=RNAisolation, colour=RNAisolation)) +
# #geom_bar(stat="identity") +
# geom_point(size=1.2) +
# scale_fill_manual(values=color_panel) +
# scale_colour_manual(values=color_panel) +
# theme_bar +
# labs(x="", y="endogenous/LP", title="RNA concentration", subtitle="Endogenous RNA vs LP")#calculate statistics for endogenous/LP (sd, sem, 95% ci)
conc <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsLP", groupvar=c("GraphKit")) %>%
mutate(measurevar_resc_oriscale = 10^measurevar_log_resc)
# Plot LP/endo in log10 scale
spikes_conc <- ggplot(conc, aes(x=reorder(GraphKit, desc(GraphKit)), y=measurevar_resc_oriscale)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=RNAisolation), size=1.2) +
geom_point(data=conc, aes(x=GraphKit, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="relative miRNA concentration") +
scale_colour_manual(values=color_panel2) +
scale_y_log10() +
theme_point +
coord_flip()
spikes_conc + theme(legend.position="none")Figure 4.1: Relative miRNA concentration in eluate after RNA purification. Concentration: ratio of endogenous miRNA to LP spikes. Values were first log transformed and rescaled to the average of MIRVE0.1, then transformed back to linear scale. Mean per kit (cross) and 95% confidence intervals shown (grey lines). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
ggsave("concentration_smallRNA.pdf", plot=spikes_conc, path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste(conc$SampleID),
"GraphKit"=paste(conc$GraphKit),
"concentration"=paste(scale(conc$measurevar_resc_oriscale, center=T, scale=T))) %>% as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
robust_z <- robzscore(conc, "measurevar_resc_oriscale")
tmp <- cbind(GraphKit = paste(conc$GraphKit), SampleID=paste(conc$SampleID), concentration=robust_z) %>% as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(conc)For small RNA sequencing purposes, we are most interested in the concentration of the eluate as we can only use a limited volume for library prep. However, by multiplying the relative miRNA concentrations above with the total eluate volume, we obtain the relative miRNA yield in the eluate after RNA isolation. In case the total eluate volume is larger, the RNA will be more diluted (this is for example the case for MIRV: 100 microL eluate compared to only 12 microL for CCF).
If the miRNA yield is high, but the eluate volume is large, further concentrating the total eluate before library prep may give better results . Of note, we did not evaluate this in our study. Scoring: the higher the yield, the better
# Correct LP/endogenous ratio for eluate V
gene_level_ratios <- gene_level_ratios %>%
mutate("EndovsLP_EluateCorr"= (endogenous/LP)*EluateV,
"EndovsLP_Input_EluateCorr"= (endogenous/LP)*EluateV/PlasmaInputV,
"RCvsLP_Input_EluateCorr" = (RC/LP)*EluateV/PlasmaInputV)
#calculate statistics for LP/endogenous (sd, sem, 95% ci)
yield <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsLP_EluateCorr", groupvar=c("GraphKit")) %>%
mutate(measurevar_resc_oriscale = 10^measurevar_log_resc)
# Plot LP/endo in log10 scale
spikes_yield <- ggplot(yield, aes(x=reorder(GraphKit,desc(GraphKit)), y=measurevar_resc_oriscale)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=RNAisolation), size=1.2) +
geom_point(data=yield, aes(x=GraphKit, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="relative miRNA yield in total eluate") +
scale_colour_manual(values=color_panel2) +
scale_y_log10() +
theme_point +
coord_flip()
spikes_yield + theme(legend.position="none")Figure 4.2: Relative miRNA yield of kits. Yield: eluate volume corrected miRNA concentration. Values were first log transformed and rescaled to the average of MIRVE0.1, then transformed back to linear scale. Mean per kit (cross) and 95% confidence intervals shown (grey lines). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
#grid_arrange_shared_legend(spikes_conc,spikes_yield)
ggsave("yield_smallRNA.pdf", plot=spikes_yield, path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)
rm(spikes_conc, spikes_yield)# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste(yield$SampleID),
"GraphKit"=paste(yield$GraphKit),
"yield"=paste(scale(yield$measurevar_resc_oriscale, center=T, scale=T))) %>% as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
robust_z <- robzscore(yield,"measurevar_resc_oriscale")
tmp <- cbind(GraphKit = paste(yield$GraphKit), SampleID=paste(yield$SampleID), yield=robust_z) %>% as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("SampleID","GraphKit"))
rm(tmp, robust_z)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(yield)Based on the previous plot with miRNA yield, we observe differences in RNA isolation efficiencies among kits. By correcting the yield for the plasma input volume, we obtain a better picture:
Remark: while we could also look at RC/LP ratio corrected for input and eluate volume (should give similar results), we decided to look at endogenous RNA as a more representative metric as this is the biomaterial of interest.
There is a clear difference in kit efficiency, with a difference of factor 10 or more.
If some adjustments would be made to kits with low input volume, but high efficiency (i.e. increase allowed plasma input V and keep eluate V as small as possible), the overall performance may further improve. Of note, we did not evaluate this in our study.
Scoring: the higher the efficiency, the better
#calculate statistics for LP/endogenous (sd, sem, 95% ci)
p1 <- ggplot(gene_level_ratios, aes(x=GraphKit, y=EndovsLP_Input_EluateCorr, fill=RNAisolation, colour=RNAisolation)) +
#geom_bar(stat="identity") +
geom_point(size=1.2) +
scale_fill_manual(values=color_panel2) +
scale_colour_manual(values=color_panel2) +
theme_bar +
labs(x="", y="Relative efficiency", title="Efficiency of kits")
eff <- log_rescaling_CI(gene_level_ratios, measurevar="EndovsLP_Input_EluateCorr", groupvar=c("GraphKit")) %>%
mutate(measurevar_resc_oriscale = 10^measurevar_log_resc)
p1 <- ggplot(eff, aes(x=reorder(GraphKit, dplyr::desc(GraphKit)), y=measurevar_resc_oriscale)) +
#geom_bar(position=position_dodge(), stat="identity")+
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), colour="grey", width=.1) +
geom_point(aes(colour=RNAisolation), size=1.2) +
geom_point(data=eff, aes(x=GraphKit, y=mean_oriscale), shape=4, colour="grey") +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
labs(x="", y="relative RNA isolation efficiency") +
scale_colour_manual(values=color_panel2) +
scale_y_log10() +
coord_flip() +
theme_point
p1 + theme(legend.position="none")Figure 4.3: Relative efficiency of kits. Efficiency: plasma input volume corrected miRNA yield. Values were first log transformed and rescaled to the average of MIRVE0.625, then transformed back to linear scale. Mean per kit (cross) and 95% confidence intervals shown. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
ggsave("efficiency_endo_smallRNA.pdf", plot=p1, path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)
rm(p1)# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste(eff$SampleID),
"GraphKit"=paste(eff$GraphKit),
"efficiency"=paste(scale(eff$measurevar_resc_oriscale, center=T, scale=T))) %>% as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
## First for endogenous efficiency
robust_z <- robzscore(eff,"measurevar_resc_oriscale")
tmp <- cbind(GraphKit = paste(eff$GraphKit), SampleID=paste(eff$SampleID), efficiency=robust_z) %>% as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("SampleID","GraphKit"))
rm(tmp, robust_z)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(eff)Two examples of pairwise kit-volume comparisons together with their cutoff and r-squared value (based on linear model (y=x) of log counts). Histograms show the relative number of RNAs with counts in that bin. For an overview of the cutoffs for each kit-volume combination, see next tab 95% SP elimination cutoffs.
miRNAs[is.na(miRNAs)] <- 0
double_positives<-miRNAs %>%
mutate_if(is.numeric, funs(replace(., .< 1, 0))) #remove pseudocounts
#make table for the 15 GraphKits (kit+volume) containing the pairwise combinations of replicates + amount of SP/DP/DN (before and after filtering)
single_pos <- data.frame(GraphKit = rep(unique(sample_annotation$GraphKit),3) %>% sort(),
Replicates = rep(c("RNA1-RNA2", "RNA1-RNA3","RNA2-RNA3"), length(unique(sample_annotation$GraphKit))),
SP_no_filter = NA, DP_no_filter = NA, DN_no_filter = NA,
SP_after_filter = NA, DP_after_filter = NA, DN_after_filter = NA,
filter_threshold = NA)
single_pos$full_name <- paste(single_pos$GraphKit, single_pos$Replicates, sep="_")
#for every kit+volume combination: determine the pairwise cutoffs
for (UniqueKit in unique(sample_annotation$GraphKit)){
sample_duplicates<-sample_annotation %>% filter(GraphKit==UniqueKit) %>%
dplyr::select(UniqueID,GraphKit,TechnicalReplicate)
if(nrow(sample_duplicates)>1){
#print(UniqueKit)
double_pos_sample<-double_positives %>%
dplyr::select(MIMATID,sample_duplicates$UniqueID)
samples_comb <- combn(sample_duplicates$UniqueID,2) #compare 2 of the 3 samples at a time
for (n_col in 1:ncol(samples_comb)){
#print(samples_comb[,n_col])
samplename1 <- samples_comb[,n_col][1]
sample1 <- sample_annotation[sample_annotation$UniqueID==samplename1,]$TechnicalReplicate
samplename2 <- samples_comb[,n_col][2]
sample2 <- sample_annotation[sample_annotation$UniqueID==samplename2,]$TechnicalReplicate
varname <- paste0(UniqueKit,"_",sample1,"-",sample2)
#print(varname)
double_pos_subset <- double_pos_sample %>% dplyr::select(MIMATID, paste(samplename1), paste(samplename2))
double_pos_subset$pos_type <- as.factor(
ifelse(double_pos_subset[,paste(samplename1)] > 0 &
double_pos_subset[,paste(samplename2)] > 0, "DP", #double positive
ifelse(double_pos_subset[,paste(samplename1)] == 0 &
double_pos_subset[,paste(samplename2)] ==0 , "DN", #double negative
"SP"))) # else: single positive
single_p <- double_pos_subset %>%
filter(pos_type=="SP") %>%
mutate(., max=pmax(get(samplename1), get(samplename2)))
#Threshold that removes 95% of the single positives
threshold <- round(as.numeric(paste(quantile(single_p$max,probs = 0.95, na.rm=TRUE)))+0.005, 2) #get the 95% quantile number, round UP to two decimal numbers
double_pos_subset$colouring <- as.factor(ifelse(double_pos_subset[,paste(samplename1)] > threshold, "> cutoff",
ifelse(double_pos_subset[,paste(samplename2)] > threshold, "> cutoff", "<= cutoff")))
single_pos[single_pos$full_name==paste(varname),]$filter_threshold <- threshold
#Single Positives
single_pos[single_pos$full_name==paste(varname),]$SP_no_filter <- sum(
((double_pos_subset[,paste(samplename1)] > 0) & (double_pos_subset[,paste(samplename2)] == 0)) |
((double_pos_subset[,paste(samplename1)] == 0) & (double_pos_subset[,paste(samplename2)] > 0))
)
single_pos[single_pos$full_name==paste(varname),]$SP_after_filter <- sum(
((double_pos_subset[,paste(samplename1)] > threshold) & (double_pos_subset[,paste(samplename2)] == 0)) |
((double_pos_subset[,paste(samplename1)] == 0) & (double_pos_subset[,paste(samplename2)] > threshold))
)
#Double Positives
single_pos[single_pos$full_name==paste(varname),]$DP_no_filter <- sum((double_pos_subset[,paste(samplename1)] > 0) &
(double_pos_subset[,paste(samplename2)] > 0))
single_pos[single_pos$full_name==paste(varname),]$DP_after_filter <- sum((double_pos_subset[,paste(samplename1)] > threshold) &
(double_pos_subset[,paste(samplename2)] > threshold))
#Double Negatives
single_pos[single_pos$full_name==paste(varname),]$DN_no_filter <- sum((double_pos_subset[,paste(samplename1)] == 0) &
(double_pos_subset[,paste(samplename2)] == 0))
single_pos[single_pos$full_name==paste(varname),]$DN_after_filter <- sum((double_pos_subset[,paste(samplename1)] <= threshold) &
(double_pos_subset[,paste(samplename2)] <= threshold))
#Calculate percentages of SP and DP remaining after filtering
single_pos$remainingSP <- single_pos$SP_after_filter/single_pos$SP_no_filter
single_pos$remainingDP <- single_pos$DP_after_filter/single_pos$DP_no_filter
#correlation_data <- double_pos_subset %>% dplyr::select(c(paste(samplename1),paste(samplename2))) %>% dplyr::filter_if(is.numeric,all_vars(.>0)) #take out genes that are not >0 in both samples
lm_tmp <- lm(log(pull(double_pos_sample,samplename1)+1,10) ~ -1 + log(pull(double_pos_sample, samplename2)+1,10)) # fit linear model of log values technical replicates (no intercept)
lm_rsquared <- summary(lm_tmp)$r.squared #take r-squared of lm
p <- ggplot(double_pos_subset, aes(x=log(get(samplename1)+1,10), y=log(get(samplename2)+1,10), col=colouring)) +
geom_point(alpha=0.3,size=0.4) +
#geom_hline(yintercept = log(quantile(single_pos$max,probs = 0.95, na.rm=TRUE)+1,10)) +
#geom_vline(xintercept = log(quantile(single_pos$max,probs = 0.95, na.rm=TRUE)+1,10)) +
theme_classic() +
scale_x_continuous(limits=c(0,5)) +
scale_y_continuous(limits=c(0,5)) +
theme(plot.title=element_text(size=9), plot.subtitle=element_text(size=7),
legend.title=element_blank(), legend.text=element_text(size=8), legend.position="top",
axis.title=element_text(size=8)) +
scale_color_manual(values=color_panel[2:6]) +
guides(color = guide_legend(override.aes = list(alpha = 1))) + #make sure legend color is clear
labs(title=paste(varname, ": cutoff of", threshold),
subtitle=paste0("% remaining after filtering: ",
round(single_pos[single_pos$full_name==paste(varname),]$remainingSP*100,2),"% of SP (R2 = ",
round(lm_rsquared,3), ")"), #R squared of linear model fitting log values of 2 technical replicates
x=paste0("log10(",sample1,"+1)"), y=paste0("log10(",sample2,"+1)"))
#print(p)
rm(lm_tmp,lm_rsquared)
}
}
}sample_duplicates<-sample_annotation_all %>% filter(GraphKit=="CIRC5") %>%
dplyr::select(UniqueID,GraphKit,TechnicalReplicate)
sample1 <- sample_duplicates$UniqueID[1]
sample2 <- sample_duplicates$UniqueID[3]
double_pos_sample<-double_positives %>%
dplyr::select(MIMATID,sample1, sample2)
varname = "CIRC5_RNA1-RNA3"
threshold <- as.numeric(paste(single_pos %>% filter(GraphKit=="CIRC5" & Replicates=="RNA1-RNA3") %>% dplyr::select(filter_threshold)))
double_pos_sample$colouring <- as.factor(ifelse(double_pos_sample[,paste(sample1)] > threshold, "> cutoff",
ifelse(double_pos_sample[,paste(sample2)] > threshold, "> cutoff", "<= cutoff")))
lm_tmp <- lm(log(pull(double_pos_sample,sample1)+1,10) ~ -1 + log(pull(double_pos_sample, sample2)+1,10)) # fit linear model of log values technical replicates (no intercept)
lm_rsquared <- summary(lm_tmp)$r.squared #take r-squared of lm
p <- ggplot(double_pos_sample, aes(x=log(get(sample1)+1,10), y=log(get(sample2)+1,10), col=colouring)) +
geom_point(size=0.5, alpha=0.4) +
theme_classic() +
theme(plot.title=element_text(size=9, margin=margin(1,0,0,0)), plot.subtitle=element_text(size=7),
legend.title=element_blank(), legend.text=element_text(size=8), legend.position="bottom",
axis.title=element_text(size=8)) +
scale_color_manual(values=color_panel[2:6]) +
guides(color = guide_legend(override.aes = list(alpha = 1))) + #make sure legend color is clear
scale_x_continuous(limits=c(0,5)) +
scale_y_continuous(limits=c(0,5)) +
labs(title=paste(varname, "with cutoff of", threshold),
subtitle=paste0("% remaining after filtering: ",
round(single_pos[single_pos$full_name=="CIRC5_RNA1-RNA3",]$remainingSP*100,2),"% of SP (R2 = ",
round(lm_rsquared,3), ")"), #R squared of linear model fitting log values of 2 technical replicates
x=paste0("log10(replicate1 + 1)"), y=paste0("log10(replicate3 + 1)"))
rm(lm_rsquared, lm_tmp)
ggExtra::ggMarginal(p, type = "histogram", size=7)Figure 4.4: Pairwise miRNA count comparison of first and third replicate of the CIRC5 kit. The coefficient of determination is 0.941 (linear model that fits log10 values). Single positives are 0 in one replicate and > 0 in other. The cutoff of 3 removes 97.87% of single positives. Green dots show data points that are filtered out with this cutoff. (CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input)
ggsave("CIRC5_afterdedup.pdf",ggExtra::ggMarginal(p, type = "histogram", size=7), path= here::here("data_output","plots"), height=4.2, width=3.8, dpi = 300, useDingbats=F)
#write.table(x = dplyr::select(double_pos_sample, c("ensembl_gene_id", paste(sample1), paste(sample2))), file = "test_CIRC5_RNA1_3.txt",sep="\t",quote = F, na = "",row.names=F)sample_duplicates<-sample_annotation_all %>% filter(GraphKit=="MIRV0.1") %>%
dplyr::select(UniqueID,GraphKit,TechnicalReplicate)
sample1 <- sample_duplicates$UniqueID[1]
sample2 <- sample_duplicates$UniqueID[3]
double_pos_sample<-double_positives %>%
dplyr::select(MIMATID,sample1, sample2)
varname = "MIRV0.1_RNA1-RNA3"
threshold <- as.numeric(paste(single_pos %>% filter(GraphKit=="MIRV0.1" & Replicates=="RNA1-RNA3") %>% dplyr::select(filter_threshold)))
double_pos_sample$colouring <- as.factor(ifelse(double_pos_sample[,paste(sample1)] > threshold, "> cutoff",
ifelse(double_pos_sample[,paste(sample2)] > threshold, "> cutoff", "<= cutoff")))
lm_tmp <- lm(log(pull(double_pos_sample,sample1)+1,10) ~ -1 + log(pull(double_pos_sample, sample2)+1,10)) # fit linear model of log values technical replicates (no intercept)
lm_rsquared <- summary(lm_tmp)$r.squared #take r-squared of lm
p <- ggplot(double_pos_sample, aes(x=log(get(sample1)+1,10), y=log(get(sample2)+1,10), col=colouring)) +
geom_point(size=0.5, alpha=0.4) +
theme_classic() +
theme(plot.title=element_text(size=9, margin=margin(1,0,0,0)), plot.subtitle=element_text(size=7),
legend.title=element_blank(), legend.text=element_text(size=8), legend.position="top",
axis.title=element_text(size=8)) +
scale_color_manual(values=color_panel[2:6]) +
guides(color = guide_legend(override.aes = list(alpha = 1))) + #make sure legend color is clear
scale_x_continuous(limits=c(0,5)) +
scale_y_continuous(limits=c(0,5)) +
labs(title=paste(varname, "with cutoff of", threshold),
subtitle=paste0("% remaining after filtering: ",
round(single_pos[single_pos$full_name=="MIRV0.1_RNA1-RNA3",]$remainingSP*100,2),"% of SP (R2 = ",
round(lm_rsquared,3), ")"), #R squared of linear model fitting log values of 2 technical replicates
x=paste0("log10(replicate1 + 1)"), y=paste0("log10(replicate3 + 1)"))
print(ggExtra::ggMarginal(p, type = "histogram", size=7))Figure 4.5: Pairwise miRNA count comparison of first and third replicate with MIRV0.1 kit. Coefficient of determination is 0.917 (based on linear model that fits log10 values). Single positives are 0 in one replicate and > 0 in other. The cutoff of 6 removes 95.61% of single positives. Green dots show data points that are filtered out with this cutoff. (MIRV0.1: mirVana PARIS kit, 0.1 ml input)
ggsave("MIRV0.1_afterdedup.pdf",ggExtra::ggMarginal(p, type = "histogram", size=7), path= here::here("data_output","plots"), height=4.2, width=3.8, dpi = 300, useDingbats=F)
#write.table(x = dplyr::select(double_pos_sample, c("ensembl_gene_id", paste(sample1), paste(sample2))), file = "test_MIRV0.1_RNA1_3.txt",sep="\t",quote = F, na = "",row.names = F)If all counts smaller than or equal to cutoff are eliminated, at least 95% of single positives are removed, resulting in data that is highly reproducible
Cutoff is always higher for lower input volume within same kit (with lower input volume, there is more variation in which genes are detected in each replicate)
Cutoffs are close to each other except in the mirVana kits (MIRV & especially MIRVE). Note that 1 count difference can already have a major impact on the number of miRNAs filtered out
We use the median cutoff per kit-volume combination for filtering in further analyses (see table below)
Scoring: take the negative of the cutoff values (so that higher = better precision)
cutoff_kit <- single_pos %>% group_by(GraphKit) %>%
dplyr::summarise(median_th = median(filter_threshold))
p <- ggplot(single_pos, aes(x=GraphKit, y=filter_threshold, color=Replicates)) +
geom_point() +
theme_point +
theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))+
scale_color_manual(values=color_panel[3:5]) +
scale_y_continuous(limits = c(0,NA)) +
labs(y="95% SP cutoff (counts)", x="", color="replicate pairs")
print(p)Figure 4.6: Count threshold that removes 95% of single positives for each pairwise comparison of replicates. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
# normal z-score calculation
## calculate individual z-score for all data points
# first change the sign of CV to make sure higher = better
tmp_summary <- single_pos %>% group_by(GraphKit) %>% dplyr::summarize(threshold=median(filter_threshold)) %>%
mutate_if(is.numeric, funs(. * -1)) #make counts negative so that a higher metric value corresponds to a better performance
tmp <- cbind("GraphKit"=paste(tmp_summary$GraphKit),
"precision_threshold"=paste(scale(tmp_summary$threshold, center=T, scale=T))) %>%
as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
robust_z <- robzscore(tmp_summary, "threshold")
tmp <- cbind(GraphKit = paste(tmp_summary$GraphKit), precision_threshold=robust_z) %>% as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("GraphKit"))
rm(tmp, robust_z)
rm(list=grep("pos|tmp|duplicates|test|single|nique|replicate|name",ls(),value=T))Figure 4.7: Impact of filtering (filter removes 95% of single positives). Left: % of total counts that are kept after applying filter; right: sum of counts that are not filtered out. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste0(filter_df$GraphKit,"_",filter_df$TechnicalReplicate),
"GraphKit"=paste(filter_df$GraphKit),
"data_retention"=paste(scale(filter_df$percentage_countskept, center=T, scale=T))) %>% as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
## First for endogenous efficiency
robust_z <- robzscore(filter_df,"percentage_countskept")
tmp <- cbind(GraphKit = paste(filter_df$GraphKit), SampleID=paste0(filter_df$GraphKit,"_",filter_df$TechnicalReplicate), data_retention=robust_z) %>%
as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("SampleID","GraphKit"))
rm(tmp, robust_z)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(list=grep("tmp|test|filter",ls(),value=T))
rm(p1,p2,p3)miRNAs_long <- miRNAs %>% gather(., key="UniqueID", value="counts", -"MIMATID") %>% #long format
left_join(., dplyr::select(sample_annotation,c(UniqueID,GraphKit,SampleID)), by="UniqueID") #add kit column
#keep only protein coding miR with more than 0 counts
miRNAs_long <- miRNAs_long %>% filter(counts > 0)
number_miR_detected <- miRNAs_long %>% group_by(SampleID) %>% dplyr::summarize(miR_above0=n())
number_miR_detected <- full_join(number_miR_detected,
miRNAs_long %>% group_by(SampleID) %>%
dplyr::summarize(total_est_counts_above0=sum(counts)),
by="SampleID")
#cutoff_kit <- single_pos %>% group_by(GraphKit) %>% dplyr::summarise(median_th = median(filter_threshold))
miRNAs_cutoff <- miRNAs %>% gather(., key="UniqueID", value="counts", -"MIMATID") %>% #long format
left_join(., dplyr::select(sample_annotation,c(UniqueID,GraphKit,SampleID)), by="UniqueID") %>% #add kit column
left_join(., cutoff_kit, by="GraphKit") #add the median cutoff for each kit
miRNAs_cutoff <- miRNAs_cutoff %>%
filter(counts > median_th)
number_miR_detected <- full_join(number_miR_detected,
miRNAs_cutoff %>% group_by(SampleID) %>%
dplyr::summarize(miR_aboveTh=n()),
by="SampleID")
number_miR_detected <- full_join(number_miR_detected,
miRNAs_cutoff %>% group_by(SampleID) %>%
dplyr::summarize(total_est_counts_aboveTh=sum(counts)),
by="SampleID")
number_miR_detected <- left_join(number_miR_detected,
dplyr::select(sample_annotation, c(GraphKit, RNAisolation, EluateV,SampleID, TechnicalReplicate)),
by="SampleID")
#convert to the original format
miRNAs_cutoff <- dplyr::select(miRNAs_cutoff, MIMATID, UniqueID, counts) %>%
spread(., key=UniqueID, value=counts)
#write.csv(miRNAs_cutoff, file="../../../exRNAQC/miRNAs_cutoff.csv",row.names=FALSE, na="")# Absolute nr of miR (no filter)
p1 <- ggplot(number_miR_detected,aes(x=reorder(GraphKit, desc(GraphKit)),y=miR_above0, col=RNAisolation))+
geom_point() +
theme_point+
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
labs(x="", y="# miRNAs (all)") +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel2) +
coord_flip()
#theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))
#facet_wrap(~GraphKit, nrow = 1)
# Relative nr of miR (no filter)
test <- log_rescaling_CI(number_miR_detected, measurevar="miR_above0", groupvar="GraphKit",conf=0.95, log_base=10)
p2 <- ggplot(test, aes(x=GraphKit, y=mean_oriscale, colour=RNAisolation)) +
geom_point() +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), width=.1) +
geom_line() +
labs(x="", y="relative # miRNAs",title="Relative number of miRNAs (unfiltered)") +
scale_colour_manual(values=color_panel2) +
scale_y_continuous(limits=c(0,NA)) +
theme_point +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))
# Absolute nr of miR (after 95% SP elimination cutoff)
p3 <- ggplot(number_miR_detected,aes(x=reorder(GraphKit, desc(GraphKit)),y=miR_aboveTh, col=RNAisolation))+
geom_point() +
theme_point+
theme(panel.grid.major.x=element_line(linetype = "dashed",color="gray88"), axis.text.x=element_text(size=8)) +
labs(x="", y="# miRNAs (reproducibly detected)") +
scale_y_continuous(limits = c(0, NA))+
scale_color_manual(values=color_panel2) +
coord_flip()
#theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))
#facet_wrap(~GraphKit, nrow = 1)
# Relative nr of miR (after 95% SP elimination cutoff)
test <- log_rescaling_CI(number_miR_detected, measurevar="miR_aboveTh", groupvar="GraphKit",conf=0.95, log_base=10)
p4 <- ggplot(test, aes(x=GraphKit, y=mean_oriscale, colour=RNAisolation)) +
geom_point() +
geom_hline(yintercept = 1, linetype="dashed",colour="grey88") +
geom_errorbar(aes(ymin=ci_lower_oriscale, ymax=ci_upper_oriscale), width=.1) +
geom_line() +
labs(x="", y="relative # miRNAs",title="Relative number of miRNAs (filtered)") +
scale_colour_manual(values=color_panel2) +
scale_y_continuous(limits=c(0,NA)) +
theme_point +
theme(axis.text.x=element_text(angle=90, vjust=0.5, hjust=1))
ggsave("miR_nofilter.pdf", plot=p1, path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)
ggsave("miR_filtered.pdf", plot=p3, path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)
write.table(spikes, "../data_raw/spikes_exRNAQC011.txt", quote = F, na = "",sep = "\t", row.names = F)Figure 4.8: Number of miRNAs that are detected per 1.5 M reads. Left: all miRNAs that are detected with at least one count; right: miRNAs that are reproducibly detected (≥ precision threshold that eliminates 95% of single positives). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste(number_miR_detected$SampleID),
"GraphKit"=paste(number_miR_detected$GraphKit),
"miR_count"=paste(scale(number_miR_detected$miR_aboveTh, center=T, scale=T))) %>% as.tibble(.) %>%
type_convert(.) #convert columns with all numbers to type "dbl"
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp, by=c("SampleID","GraphKit"))
rm(tmp)
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
## First for endogenous efficiency
robust_z <- robzscore(number_miR_detected,"miR_aboveTh")
tmp <- cbind(GraphKit = paste(number_miR_detected$GraphKit), SampleID=paste(number_miR_detected$SampleID), miR_count=robust_z) %>%
as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp, by=c("SampleID","GraphKit"))
rm(tmp, robust_z)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(list=grep("tmp|test|number_genes",ls(),value=T))
rm(p,p1,p2,p3,p4)ALC (area-left-of-curve) as repeatability metric (see Mestdagh et al., 2014, Nature Methods). Compare two technical replicates at a time, only considering miRNAs that reach the precision threshold (which eliminates 95% of single positives) in at least one of the samples and ≥ 1 count in the other sample. Replace all zero counts by NA. Determine the absolute value of the log2 ratios of counts for each miRNA. Plotted are the (percentage) ranks of these absolute log2 ratios. The faster the curve reaches 100% (the smaller the ALC), the better. A small ALC indicates that the replicates give similar counts for each miRNA.
Scoring: lower ALC = better precision
double_positives<-miRNAs %>% replace(., is.na(.),0) %>% #first change NAs to 0
mutate_if(is.numeric, funs(round)) #round everything to nearest integer
double_positives<-double_positives[rowSums(double_positives %>% select_if(is.numeric) > 1, na.rm=T)>0,] # keep only the rows where at least one sample has more than 1
ALC_input_all <- data.frame()
for (duplicate_type in unique(sample_annotation$GraphKit)){
sample_duplicates<-sample_annotation%>%
filter(GraphKit==duplicate_type) %>%
dplyr::select(UniqueID,GraphKit,TechnicalReplicate)
cutoff95SP <- cutoff_kit[cutoff_kit$GraphKit==duplicate_type,]$median_th
if(nrow(sample_duplicates)>1){
#print(duplicate_type)
#double_positives_sample<-double_positives %>% dplyr::select(MIMATID,sample_duplicates$UniqueID) # only keep the replicates of one type
samples_comb <- combn(sample_duplicates$UniqueID,2) #compare 2 of the 3 samples at a time
for (n_col in 1:ncol(samples_comb)) {
#print(samples_comb[,n_col])
nr_runA <- gsub("^[A-Z]+","",sample_annotation[sample_annotation$UniqueID==
samples_comb[1,n_col],]$TechnicalReplicate)
nr_runB <- gsub("^[A-Z]+","",sample_annotation[sample_annotation$UniqueID==
samples_comb[2,n_col],]$TechnicalReplicate)
varname <- paste0("Rep",nr_runA,"_",nr_runB) #make a name so you can backtrace which replicates are compared
#print(varname)
double_positives_2repl <- double_positives %>% dplyr::select(MIMATID, paste(samples_comb[1,n_col]), paste0(samples_comb[2,n_col])) %>%
filter_if(is.numeric, all_vars(.>0)) %>% #keep only the 2 replicates of one kit and remove genes where not at least 1 of them has > 0 counts
filter_if(is.numeric, any_vars(.>cutoff95SP)) # remove genes where neither of the replicates reach the threshold that removes 95% of SP in that kit
correlation_samples<-double_positives_2repl %>%
mutate(log2_ratio=abs(log(get(samples_comb[1,n_col]),2)-log(get(samples_comb[2,n_col]),2))) %>%
dplyr::select(log2_ratio,MIMATID) #%>% drop_na()
ALC_input<-correlation_samples %>% arrange(log2_ratio) %>% # order by log2 ratio and then make a rank (counter) and rescale this to 1 (perc_counter)
#mutate(rank=percent_rank(log2_ratio)) %>% # this does not work: gives everything with log2ratio = 0 rank 0
mutate(counter = seq(1:nrow(double_positives_2repl)), GraphKit=duplicate_type, Replicates=varname) %>%
mutate(ReplID = paste0(GraphKit,"-",Replicates), perc_counter = counter/nrow(double_positives_2repl))
ALC_input_all <- rbind(ALC_input_all, ALC_input)
}
}
}
max_ALC <-max(ALC_input_all$log2_ratio) # calculate the max ALC over everything (necessary for area calculation -> should always be the same in order to compare among kits)
ALC <- ALC_input_all %>% group_by(GraphKit,Replicates) %>%
dplyr::summarise(ALC_calc = sum(log2_ratio)/(max_ALC*length(MIMATID))) %>%
#dplyr::summarise(ALC_calc = sum(log2_ratio)/(max_ALC)) %>%
mutate(ReplID = paste0(GraphKit,"-",Replicates))
for (replicates in unique(ALC_input_all$ReplID)) { # plot the ALC (= the colored part of the plot)
print(ggplot(ALC_input_all %>% dplyr::filter(ReplID==replicates) %>%
mutate(log2_ratio_resc = log2_ratio/max_ALC))+
geom_line(aes(x=log2_ratio_resc,y=perc_counter))+
#facet_wrap(~ReplID) +
geom_ribbon(aes(x=log2_ratio_resc,ymin=perc_counter,ymax=1), fill=color_panel1[gsub("-.*$","",replicates)])+
geom_hline(aes(yintercept = 1))+
theme_classic()+
scale_x_continuous(limits=c(0,1)) +
scale_y_continuous(expand = c(0, 0)) +
theme(legend.position = "none") +
labs(title=paste(replicates),
subtitle=paste("ALC:", round(ALC[ALC$ReplID==replicates,]$ALC_calc,2)), #print ALC for this particular comparison!
y="rank percentile",x="rescaled log2 ratio"))
}ALC <- ALC %>% arrange(GraphKit,ReplID) %>% mutate(Repl=c("RNA1","RNA3","RNA2")) %>% mutate(tmp_SampleID=paste0(GraphKit,"_",Repl)) ## just for easy integration later on order according to Graphkit and replicates (so first Repl1_2, then Repl1_3, then Repl2_3) and add a sampleID column like in other dfs
#remark: if you have more or less replicates, change mutate(Repl) part above accordingly
ALC <- left_join(ALC, sample_annotation[,c("SampleID","RNAisolation")], by=c("tmp_SampleID"="SampleID"))
ALCplot <- ggplot(ALC %>% drop_na())+
geom_point(aes(y=ALC_calc,x=reorder(GraphKit,dplyr::desc(GraphKit)),color=RNAisolation),size=1.3)+
#geom_text_repel(aes(y=value,x=GraphKit), nudge_x=0.1)+
theme_point +
labs(y="ALC",x="")+
#theme(panel.grid.major.y=element_line(linetype = "dashed",color="lightgray"))+
scale_color_manual(values=color_panel2) +
theme(legend.position = "none") +
scale_y_continuous(limits=c(0,NA))
#pdf(file=here::here("data_output","Kits_mRNA_ALC.pdf"), height=4, width=6)
ALCplot + coord_flip()Figure 4.9: Precision based on ALCs (area-left-of-curve). The lower the ALC, the better (less difference between replicates). (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)
#ALCplot + theme(axis.text.x = element_text(angle=90, vjust=0.5, hjust=1))
#dev.off()
ggsave("ALC_smallRNA.pdf", plot=ALCplot + coord_flip(), path= here::here("data_output","plots"), height=4, width=6, dpi = 300, useDingbats=F)# normal z-score calculation
## calculate individual z-score for all data points
tmp <- cbind("SampleID"=paste(ALC$tmp_SampleID),
"GraphKit"=paste(ALC$GraphKit),
"neg_ALC" = as.numeric(paste(ALC$ALC_calc)) * (-1)) %>% #take negative value of ALC (so that higher = better)
as_tibble() %>% type_convert() %>%
mutate("precision_ALC" = scale(neg_ALC, center=T, scale=T))
## add these to the z-score df
z_indiv <- left_join(z_indiv, tmp %>% dplyr::select(-"neg_ALC"), by=c("SampleID","GraphKit"))
# robust z-score calculation
## Calculate robust z scores to account for small sample sizes
robust_z <- robzscore(tmp, "neg_ALC")
tmp2 <- cbind(GraphKit = paste(tmp$GraphKit), SampleID=paste(tmp$SampleID), "precision_ALC"=robust_z) %>% as.tibble(.) %>% type_convert(.)
## add these to robust z-score df
z_indiv_robust <- left_join(z_indiv_robust, tmp2, by=c("SampleID","GraphKit"))
rm(tmp, tmp2, robust_z)
## variable correlation
#cor_z_scores <- cor(z_indiv$concentration, z_indiv_robust$concentration, method = "spearman")
rm(ALC)
rm(list=grep("tmp|test",ls(),value=T))Based on robust z-scores (for each performance metric: higher robust z-score is better)
z_score_df_test <- z_indiv_robust %>% drop_na() %>% as.tibble() %>%
dplyr::select(-c("SampleID","GraphKit"))
colnames(z_score_df_test) <- gsub("_"," ", colnames(z_score_df_test))
cor_z_scores <- cor(z_score_df_test %>% type_convert(.), method = "spearman")
require("RColorBrewer")
require("corrplot")
#pdf(here::here("data_output/plots/metric_corr_indiv_smallRNA.pdf"), height = 6, width=6, useDingbats=F)
corrplot(cor_z_scores, order="hclust", type="upper",
hclust.method="complete",
tl.srt = 45, #text labels rotated 45 degrees
method="color", number.cex=0.8,
addCoef.col = "black", # Add coefficient of correlation
tl.col = "black", #text label color
tl.cex = 0.8, #text label size
#cl.pos = "n", #position of color labels ("n"=none)
cl.align.text="l", #align color labels to the left
cl.cex=0.7, #color label size
col=colorRampPalette(brewer.pal(8,"RdYlBu"))(100))Figure 4.10: Correlation among metrics based on robust z-scores. Spearman correlation with complete hierarchical clustering. (concentration: relative endogenous miRNA concentration based LP spikes; data retention: % of counts left after applying precision threshold; efficiency: yield corrected for plasma input volume ; miR count: number of miRNAs after applying precision threshold; precision ALC: precision based on area-left-of-curve; precision threshold: count threshold to filter out 95% of single positives between replicates; yield: concentration corrected for eluate volume)
tmp <- z_indiv_robust %>% drop_na() %>% arrange(GraphKit) %>%
dplyr::group_by(GraphKit) %>% summarise_if(is.numeric, mean) %>%
left_join(unique(dplyr::select(sample_annotation, c("GraphKit"#,"input_volume"="PlasmaInputV"))))%>%
)))) %>%
column_to_rownames("GraphKit")
colnames(tmp) <- gsub("_"," ",colnames(tmp))
#add an average z-score row
tmp2 <- tmp %>% rowwise() %>% dplyr::summarise(average = mean(c_across(where(is.numeric))))
# annotation_row <- unique(data.frame(GraphKit=sample_annotation$GraphKit)) %>% #input_volume=sample_annotation$PlasmaInputV, eluate_volume=sample_annotation$EluateV))
# as_tibble()
# annotation_row$type <- as.factor(ifelse((str_detect(pattern="MAP|MAX", string=paste(annotation_row$GraphKit))), "automated", "manual"))
# annotation_row <- annotation_row %>% column_to_rownames("GraphKit")
# annotation_row_color <- list(type = c(manual="white", automated="grey"))
require(pheatmap)
#this one does not rescale values within one measure variable
#pdf(here::here("data_output/plots/overview_performance_withnrs.pdf"), height = 3.5, width=7, useDingbats=F)
custom_palette = colorRampPalette(rev(brewer.pal(n = 7, name =
"RdBu")))(100)
#make sure the middle color is positioned around 0
myBreaks <- c(seq(min(tmp), 0, length.out=ceiling(length(custom_palette)/2) + 1),
seq(max(tmp)/length(custom_palette), max(tmp),
length.out=floor(length(custom_palette)/2)))
# define the annotation
annotation_row = data.frame(
average = c(tmp2$average))
rownames(annotation_row) = rownames(tmp)
ann_colors = list(
average = c("white", "black"))
pheatmap::pheatmap(t(tmp),
#color= colorRampPalette(rev(brewer.pal(n = 7, name ="RdBu")))(100), #default
color = custom_palette,
breaks = myBreaks,
#color = colorRampPalette(brewer.pal(n=10, "Greys"))(10),
#scale = "row",
#cluster_rows = F,
annotation_col= annotation_row,
display_numbers = T, number_format = "%.1f", fontsize_number=8, number_color="black",
annotation_colors = ann_colors[1])Figure 4.11: Comparison of kit performance based on robust z-scores. Higher means a better performance. Complete hierarchical clustering based on Euclidean distance. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input; concentration: relative endogenous miRNA concentration based LP spikes; data retention: % of counts left after applying precision threshold; efficiency: yield corrected for plasma input volume ; miR count: number of miRNAs after applying precision threshold; precision ALC: precision based on area-left-of-curve; precision threshold: count threshold to filter out 95% of single positives between replicates; yield: concentration corrected for eluate volume; average: average z score over all metrics)
#dev.off()
#ggsave("Overview_performance.pdf", plot=ggplot2::last_plot(), path= here::here("data_output","plots"), height=5, width=6, dpi = 300, useDingbats=F)tmp_resc <- apply(tmp, 2, rescale) # rescale all metrics (columns) to values between 0 and 1
annotation_row = data.frame(
average = rescale(c(tmp2$average), to=c(0,1)))
rownames(annotation_row) = rownames(tmp)
ann_colors = list(
average = c("white", "black"))
#pdf(here::here("data_output/plots/overview_performance_rescaled.pdf"), height = 3.5, width=7, useDingbats=F)
pheatmap(t(tmp_resc),
#color = colorRampPalette(brewer.pal(n=10, "Greys"))(10),
color = custom_palette,
#scale = "column",
#cluster_rows = F,
annotation_col= annotation_row,
#treeheight_col = 0,
annotation_colors = ann_colors[1])Figure 4.12: Comparison of kit performance based on rescaled robust z-scores. Robust z-scores were rescaled to range of [0,1] to emphasize differences within a metric. Closer to 1 means a better performance. Complete hierarchical clustering based on Euclidean distance. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input; concentration: relative endogenous miRNA concentration based LP spikes; data retention: % of counts left after applying precision threshold; efficiency: yield corrected for plasma input volume ; miR count: number of miRNAs after applying precision threshold; precision ALC: precision based on area-left-of-curve; precision threshold: count threshold to filter out 95% of single positives between replicates; yield: concentration corrected for eluate volume; average: average z score over all metrics)
The RNA isolation for phase 2 is based on robust z-score transformed metric for sensitivity (# detected genes, see Number of miRNAs) and for precision (ALC, see Area-left-of-curve). Higher z-score = better
We looked at both metrics but in close calls, the sensitivity was given a higher weight. We also wanted to include at least one kit which is able to purify miRNA in case you have less than 1 ml of plasma as it is not always possible to collect or use multiple ml.
Selection: Maxwell (MAX0.5) & miRNeasy serum/plasma advanced (MIRA0.6)
# ggplot(z_indiv, aes(x=gene_count, y=precision_ALC, col=GraphKit)) + geom_point() + theme_point + ggrepel::geom_text_repel(aes(label=GraphKit)) + scale_colour_manual(values=color_panel1) + theme(legend.position="none")
#
# z_indiv_med <- z_indiv %>% group_by(GraphKit) %>% summarise_if(is.numeric, median, na.rm=T)
#
# #pdf("data_output/selection_kits_mRNA_regularz.pdf", height=5, width=6)
# ggplot(z_indiv_med, aes(x=gene_count, y=precision_ALC, col=GraphKit)) + geom_point() + theme_point + ggrepel::geom_text_repel(aes(label=GraphKit)) + scale_colour_manual(values=color_panel1) + theme(legend.position="none") + labs(x="sensitivity (gene count)", y="precision (ALC)", subtitle="Regular z-scores (median per kit)")
#dev.off()
z_indiv_robust_med <- z_indiv_robust %>% group_by(GraphKit) %>% summarise_if(is.numeric, median, na.rm=T) %>% left_join(unique(sample_annotation %>% dplyr::select(c("GraphKit", "RNAisolation"))))
#pdf("data_output/selection_kits_mRNA_robustz.pdf", height=5, width=6)
ggplot(z_indiv_robust_med, aes(x=miR_count, y=precision_ALC, col=RNAisolation)) +
geom_point() + theme_point +
ggrepel::geom_text_repel(aes(label=GraphKit)) +
scale_colour_manual(values=color_panel2) +
#theme(legend.position="none") +
scale_x_continuous(limits=c(-3.2,3.2)) + scale_y_continuous(limits = c(-3.2,3.2)) +
theme(legend.position = "none") +
labs(x="sensitivity (miRNAs)", y="precision (ALC)")Figure 5.1: Robust z-scores (median per RNA isolation kit) for sensitivity (number of miRNAs) (x) and precision (ALC, area-left-of-curve) (y). MIRA0.6 and MAX0.5 kits are selected for phase II. (CCF1: QIAamp ccfDNA/RNA kit, 1 ml input; CCF4: QIAamp ccfDNA/RNA kit, 4 ml input; CIRC0.25: plasma/serum circulating and exosomal RNA purification kit, 0.25 ml input; CIRC5: plasma/serum circulating and exosomal RNA purification kit, 5 ml input; MAX0.1: Maxwell RSC miRNA plasma and exosome kit, 0.1 ml input; MAX0.5: Maxwell RSC miRNA plasma and exosome kit, 0.5 ml input; MIR0.2: miRNeasy serum/plasma kit, 0.2 ml input; MIRA0.2: miRNeasy serum/plasma advanced kit, 0.2 ml input; MIRA0.6: miRNeasy serum/plasma advanced kit, 0.6 ml input; MIRV0.1: mirVana PARIS kit, 0.1 ml input; MIRV0.625: mirVana PARIS kit, 0.625 ml input; MIRVE0.1: mirVana PARIS kit with enrichment, 0.1 ml input; MIRVE0.625: mirVana PARIS kit with enrichment, 0.625 ml input; NUC0.3: NucleoSpin miRNA plasma kit, 0.3 ml; NUC0.9: NucleoSpin miRNA plasma kit, 0.9 ml input)